-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent allocation of currently allocated resources #1046
Conversation
Just checking in here since deadline for new packages for the next TOSS release is tomorrow. I think the CI failures are just due to the fact that the expect failure alloc-check tests need to be updated to expect success:
Happy to push an update here if that would help @milroy. |
@grondo: just got out of meetings and rebased and pushed the update to the testsuite. I'm going to check the two remaining items in the description and drop the WIP tag if possible. |
I've confirmed that the exclusivity check here: flux-sched/resource/traversers/dfu_impl.cpp Line 116 in 65da8f2
ensures that everything under slot must be allocated exclusively. For example, the following simple jobspec:
version: 9999
resources:
- type: cluster
count: 1
with:
- type: rack
count: 1
with:
- type: node
count: 1
with:
- type: socket
count: 1
with:
- type: slot
label: corelevel
count: 1
with:
- type: core
exclusive: false
count: 1
# a comment
attributes:
system:
duration: 3600
tasks:
- command: [ "app" ]
slot: corelevel
count:
per_slot: 1 Produces the following (overly verbose- we'll need to fix that) match error: resource-query> m allocate t/data/resource/jobspecs/exclusive/test001.yaml
ERROR: by_excl: exclusivity conflicts at jobspec= : vertex=core0by_excl: exclusivity conflicts at jobspec= : vertex=core1
[...] From that standpoint this PR should work as intended. However, is Fluxion conforming to RFC 14 with the jobspec above? In particular, here: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_14.html#reserved-resource-types, which states:
The last sentence implies |
Honestly I do not know. We don't have command line tooling that can express that AFAIK. Maybe we could open an issue on that one for later? |
My concern is that if it's decided that Fluxion needs to support requests like the one above then the fix in this PR won't work. |
I could be mistaken but I think jobspec v1 would forbid that type of slot definition. See https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_25.html#v1-specific-resource-graph-restrictions. flux-core ensures that only v1 compliant jobspec can be submitted to the scheduler, so...is that sufficient justification for deferring this concern? |
It looks like the intention was to make an |
Yes, it is. I was a bit confused by the Resource Graph Restrictions section, but I think that together with:
(implying no other vertices can contain that key) is strong enough. |
Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.
Problem: flux-framework#1044 introduced tests to check for allocation failures due to jobs that exceed their time limit. PR flux-framework#1044 had to define the two `alloc-check` tests as `test_expect_failure` because the mechanism to prevent multiple booking was not in place. Update the two `alloc-check` tests to test_expect_success since the PR of this commit fixes multiple booking in Fluxion.
Codecov Report
@@ Coverage Diff @@
## master #1046 +/- ##
======================================
Coverage 74.3% 74.4%
======================================
Files 86 86
Lines 9432 9434 +2
======================================
+ Hits 7017 7022 +5
+ Misses 2415 2412 -3
|
After inspecting the auxiliary subsystem traversals and testing the fix in PR #1041 I'm satisfied that this fix will likely work (and shouldn't break) the aux traversals. Removing WIP and requesting an additional review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just gave this a quick sanity check on my test cluster with the artificially extended prolog/epilog and it seems to work well. (Not too different from the sharness test so maybe that doesn't add much).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too! Perhaps @jameshcorbett should add his review in case there is something here that might affect rabbit allocations.
Yeah, this looks good. We may end up deciding to tweak it as part of the traversal refactor I want to do, but it should be easier to express after that too. Really good idea @milroy! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me!
No problem @garlick, and let me know if additional details would help. I didn't include a discussion on why handling early termination is much easier than handling jobs that run longer than their walltime specifically in Let me know if I should configure this PR to close issue #1043. |
Thanks all for the discussion and reviews! Setting MWP. |
Problem: issue #1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a
free
message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion time is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck inCLEANUP
.This PR adds a check for an existing allocation on each exclusive resource vertex during traversal pruning. The check only occurs for
AT_ALLOC
traversals to allow reservations and satisfiability traversals to succeed. The check for existing allocations prevents another job from receiving the resources and allows reservations and satisfiability checks to complete. It also ensures that the check will blockAT_ALLOC
traversals from matching the resources until the jobid is removed from the per-vertexallocations
map via afree
message fromjob-manager
.The check has the (I think) unavoidable side effect that reservations on the allocated subgraph will be allowed immediately after the
at
time of theAT_ALLOC
traversal because theplanner_multi_avail_time_first
call will regard extra resources as available via incorrect aggregate counts. The side effect is minor, though, as the reservation will be blocked from transitioning to allocated until afree
message cancels the existing allocation. During the subsequent scheduling loops, if theAT_ALLOC
traversal fails the reservation will be pushed back during theAT_ALLOC_ORELSE_RESERVE
traversal. That will continue until matching resources become available.The PR fixes the reproducer from flux-framework/flux-core#5304 and passes the test in PR #1044,
but it's WIP because the behavior for non-exclusive resource requests is still suspect.I chose to place the check in the
by_excl
part ofprune
because there can only be a single allocation in(*m_graph)[u].schedule.allocations
for an exclusive vertex. While checking at non-exclusively allocated vertices is possible, the logic for determining which jobid is delayed via comparing availability counts returned byplanner
to the jobs in the(*m_graph)[u].schedule.allocations
map and then determining the true number of available resources is much more complex.Only checking at exclusive vertices can involve a deeper traversal than necessary (i.e. by checking at higher, non-exclusive vertices), but I think it will correctly block allocations provided that leaf-level vertices must be allocated exclusively. That appears to be ensured here:
flux-sched/resource/traversers/dfu_impl.cpp
Line 116 in 65da8f2
There are two items to verify before the WIP can be dropped:
test_expect_failure
totest_expect_success
int/t1024-alloc-check.t
after PR testsuite: add test for double-booking #1044 is mergedThis PR should be merged after PR #1044.